Comparison Analysis of Dimensionality Reduction Techniques for Akasa Data v24.08.01
Author: Sindhu, Murshid
Created: June 04th, 2024 Modified: August 07th, 2024

Objective

In this notebook, our objective is to explore high-dimensional Akasa data represented as embeddings generated through the text-embedding-3-small model. We will employ dimensionality reduction methods such as Sammon's mapping, t-SNE, and UMAP to transform these embeddings into a lower-dimensional space (1D, 2D, 3D) while preserving their intrinsic structure. Our aim is to compare these techniques using various metrics and to analyze how clusters are distributed across different industry categories.

Methods of Dimensionality Reduction

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique used for visualizing high-dimensional data by reducing it to 1, 2, 3 dimensions while preserving local relationships between points. This transformation makes it easier to visualize and interpret complex data patterns.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.

  • perplexity - Perplexity is a measure of the effective number of neighbors each point considers in its local neighborhood. It’s one of the most influential hyperparameters in t-SNE and directly affects the balance between emphasizing local versus global structure in the data. A low perplexity focuses on local structure, missing broader patterns and a high perplexity tries to preserve global relationships, which might overlook local connections. The optimal perplexity depends on the dataset size and its intrinsic dimensionality. While values between 5 and 50 are commonly used. Setting perplexity to 50, as it is adequate for capturing global structures while still maintaining reasonable detail in local clusters.

  • n_iter - The number of iterations determines how long the t-SNE algorithm runs to minimize the Kullback–Leibler divergence between the high-dimensional and low-dimensional representations of the data. This process is crucial for achieving a stable and meaningful embedding. Setting n_iter to 1000 as it provides enough iterations to achieve stable and well-optimized embeddings for your dataset size.

  • learning_rate - In t-SNE learning rate controls the step size for the optimization algorithm as it minimizes the Kullback-Leibler divergence between the high-dimensional and low-dimensional representations of the data. It determines how much the model adjusts its parameters during each iteration. The range of values might be between 200 and 1000. Setting learning_rate to 1000 ensures effective adjustment of parameters during optimization, crucial for high-dimensional data like yours.

  • random_state - The random_state parameter sets the seed for the random number generator used during the algorithm's execution. This seed determines the starting point for generating random numbers, which influences the initial conditions of the algorithm.

UMAP (Uniform Manifold Approximation and Projection)

UMAP reduces dimensionality while maintaining both local and global data structures, making it suitable for understanding fine-grained and broad relationships. It is faster and more scalable than t-SNE, ideal for large datasets.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.
  • n_neighbors - The number of nearest neighbors used to construct the local neighborhood for each point. Lower values capture more local structure, emphasizing finer details and local clusters. Higher values capture more global structure, blending local details into broader patterns. Setting n_neighbor to 30, as it is balances between local and global structure.

  • min_dist - Controls the minimum distance between embedded points in the lower-dimensional space. Lower values allow points to be closer together, preserving more local structure and detail. Higher values result in a more spread-out embedding, emphasizing the global structure and reducing local cluster tightness. Setting min_dist to 0.1 maintains some local structure while avoiding excessive crowding of points. This value helps ensure that points are reasonably close to each other in the embedding space without losing broader patterns.

  • random_state - Sets the seed for the random number generator to ensure reproducibility. Setting to 42 ensures reproducibility of results. Setting this seed makes sure that the UMAP results are consistent across different runs, which is essential for reliable analysis and comparisons.

Sammon's Mapping

Sammon's Mapping is a technique used to reduce the dimensionality of high-dimensional data while preserving the inter-point distances as well as possible. This technique is particularly focused on preserving the distances between points, which means it tries to keep the relative distances between points in the lower-dimensional space as close as possible to those in the original high-dimensional space.

  • n_components - The number of dimensions for the reduced embedding. For dimensionality reduction, common values are 2 or 3 for visualization. Using 1 dimension simplifies the data to a single line but may lose significant detail.

  • n_iter - It specifies the number of iterations for optimizing the embedding. It controls how many times the algorithm refines the positions of points to minimize the distance differences between the original and reduced spaces. Setting n_iter to 1000 is chosen as the algorithm has enough iterations to converge to a stable and meaningful solution.

Evaluation Metrics

To compare and evaluate the effectiveness of dimensionality reduction techniques, we use several metrics.

1. Structure Preservation

  • Trustworthiness: Evaluates how well the local structure (k-nearest neighbors) is preserved after dimensionality reduction. Higher scores indicate better retention of local relationships.

  • Continuity: Measures how well the overall local structure of the data is preserved in the reduced space. It assesses the consistency of k-nearest neighbors between the original and reduced spaces.

2. Distance and Metric Quality

  • Root Mean Squared Error (RMSE) of Distances: Calculates the average deviation of pairwise distances between points in the original and reduced spaces. Lower RMSE values indicate better preservation of distances between points.

3. Visualization Quality

  • Silhouette Score: Measures how well-separated clusters are by evaluating each point’s similarity to its own cluster versus other clusters. A higher score indicates better-defined clusters.

  • K-Nearest Neighbor (KNN) Retention: Measures how well the local distances between points are preserved in the reduced space by evaluating the retention of k-nearest neighbors. A higher score indicates better preservation of the local structure.

Aggregate Score

Aggregating evaluation metrics for dimensionality reduction involves combining multiple performance metrics into a single score to facilitate comparison. To aggregate evaluation metrics for dimensionality reduction, sum the scores of Trustworthiness (T), Continuity (C), Silhouette Score (S), and KNN Retention (R) and subtract the Root Mean Squared Error (RMSE) from this aggregate score to account for distance preservation. The result is divided by 5 to normalize the final score, which represents the equal-weighted average of all metrics. This method highlights techniques that excel in preserving both local and global structures while minimizing distance distortion.

Aggregate Score = (T + C + S + R - RMSE) / 5

Akasa Embeddings

We will apply dimensionality reduction techniques to the Akasa data, followed by hierarchical clustering to identify clusters. The dimensionality reduction will transform the high-dimensional data into a lower-dimensional space, preserving essential structures and relationships. After this transformation, hierarchical clustering will group data points based on their proximity, forming a dendrogram. The fcluster function will cut this dendrogram at a specified height (t=8) to produce distinct clusters. We will then evaluate the effectiveness of the dimensionality reduction and clustering using metrics.

Data Loading and preprocessing

In [113]:
import numpy as np
import pandas as pd
import plotly.express as px
import time

# Load data from CSV file
csv_file_path = '/home/sindhu/Downloads/final_embedding_updated.csv'
data = pd.read_csv(csv_file_path)

# Extract name, labels, and embeddings
name = data['Name']
id = data['ID']
label = data['Label']
embeddings = data.drop(['Name', 'ID', 'Label'], axis=1).values

t-SNE for Akasa Embeddings

t-SNE - 1D

In [123]:
import time
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import warnings

warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply t-SNE with 1D
tsne_embeddings_1D = TSNE(n_components=1, perplexity=50, learning_rate=1000, n_iter=1000, random_state=42).fit_transform(embeddings)

# Measure time after t-SNE
tsne_time = time.time()
print(f"t-SNE 1D: {tsne_time - start_time:.2f} seconds")

# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_1D[:, 0],
    'Name': name,
    'ID': id,
    'Labels': label
}
tsne_result_df = pd.DataFrame(tsne_result_data)

# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1']].values
distance_matrix = pdist(X, metric='euclidean')

Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_1D = fcluster(Z, t=8, criterion='maxclust')

# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_1D

# Save results to CSV
tsne_output_csv_path = 'tsne_industry_1D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
tsne_result_df['Legend'] = 'Other Nodes'
tsne_result_df.loc[tsne_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 1D t-SNE embeddings
tsne_fig_1D = px.scatter(
    tsne_result_df, 
    x='t-SNE Dimension 1', 
    y=[0] * len(tsne_result_df),  
    hover_name='Name', 
    color='Legend',  
    title='t-SNE Projection (1D)',
    hover_data={'Labels': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = tsne_result_df[tsne_result_df['Name'] == node]
    if not node_data.empty:
        tsne_fig_1D.add_annotation(
            x=node_data['t-SNE Dimension 1'].values[0],
            y=0,  # y is 0 for 1D
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
tsne_fig_1D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='y'
)

# Save the DataFrame to a CSV file
tsne_output_csv_path = 'tsne_industry_1D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)
t-SNE 1D: 3.84 seconds

t-SNE - 2D

In [34]:
import warnings
warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply t-SNE with 2D
tsne_embeddings_2D = TSNE(n_components=2, perplexity=50, learning_rate=1000, n_iter=1000, random_state=42).fit_transform(embeddings)

# Measure time after t-SNE
tsne_time = time.time()
print(f"t-SNE 2D: {tsne_time - start_time:.2f} seconds")

# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_2D[:, 0],
    't-SNE Dimension 2': tsne_embeddings_2D[:, 1],
    'Name': name,
    'ID': id,
    'Labels': label
}
tsne_result_df = pd.DataFrame(tsne_result_data)

# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1', 't-SNE Dimension 2']].values
distance_matrix = pdist(X, metric='euclidean')

Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_2D = fcluster(Z, t=8, criterion='maxclust')

# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_2D

# Save results to CSV
tsne_output_csv_path = 'tsne_industry_2D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
tsne_result_df['Legend'] = 'Other Nodes'
tsne_result_df.loc[tsne_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 2D t-SNE embeddings
tsne_fig_2D = px.scatter(
    tsne_result_df, 
    x='t-SNE Dimension 1', 
    y='t-SNE Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='t-SNE Projection (2D)',
    hover_data={'Labels': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = tsne_result_df[tsne_result_df['Name'] == node]
    if not node_data.empty:
        tsne_fig_2D.add_annotation(
            x=node_data['t-SNE Dimension 1'].values[0],
            y=node_data['t-SNE Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
tsne_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2'
)

# Save the DataFrame to a CSV file
tsne_output_csv_path = 'tsne_industry_2D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)
t-SNE 2D: 5.40 seconds

t-SNE - 3D

In [121]:
import time
import pandas as pd
import plotly.graph_objects as go
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import warnings

warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply t-SNE with 3D
tsne_embeddings_3D = TSNE(n_components=3, perplexity=50, learning_rate=1000, n_iter=1000, random_state=42).fit_transform(embeddings)

# Measure time after t-SNE
tsne_time = time.time()
print(f"t-SNE 3D: {tsne_time - start_time:.2f} seconds")

# Create DataFrame for t-SNE embeddings
tsne_result_data = {
    't-SNE Dimension 1': tsne_embeddings_3D[:, 0],
    't-SNE Dimension 2': tsne_embeddings_3D[:, 1],
    't-SNE Dimension 3': tsne_embeddings_3D[:, 2],
    'Name': name,
    'ID': id,
    'Labels': label
}
tsne_result_df = pd.DataFrame(tsne_result_data)

# Perform hierarchical clustering
X = tsne_result_df[['t-SNE Dimension 1', 't-SNE Dimension 2', 't-SNE Dimension 3']].values
distance_matrix = pdist(X, metric='euclidean')

Z = linkage(distance_matrix, method='ward')
cluster_labels_tsne_3D = fcluster(Z, t=8, criterion='maxclust')

# Add cluster labels to the DataFrame
tsne_result_df['TSNE_Cluster'] = cluster_labels_tsne_3D

# Save results to CSV
tsne_output_csv_path = 'tsne_industry_3D.csv'
tsne_result_df.to_csv(tsne_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
tsne_result_df['Legend'] = 'Other Nodes'
tsne_result_df.loc[tsne_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create the interactive 3D scatter plot using go.Scatter3d
tsne_fig_3D = go.Figure()

# Add trace for other nodes
tsne_fig_3D.add_trace(
    go.Scatter3d(
        x=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 1'],
        y=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 2'],
        z=tsne_result_df[tsne_result_df['Legend'] == 'Other Nodes']['t-SNE Dimension 3'],
        mode='markers',
        marker=dict(size=2,color='blue'),  # Adjusted size for other nodes
        name='Other Nodes',

    )
)

# Add trace for highlighted nodes
tsne_fig_3D.add_trace(
    go.Scatter3d(
        x=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 1'],
        y=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 2'],
        z=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['t-SNE Dimension 3'],
        mode='markers+text',
        marker=dict(size=6, color='red'),  # Larger size for highlighted nodes
        text=tsne_result_df[tsne_result_df['Legend'] == 'Highlighted Nodes']['Name'],
        textposition='top center',
        name='Highlighted Nodes'
    )
)

# Update layout for better readability
tsne_fig_3D.update_layout(
    showlegend=False,  
    title='t-SNE Projection (3D)',
    scene=dict(
        xaxis_title='t-SNE Dimension 1',
        yaxis_title='t-SNE Dimension 2',
        zaxis_title='t-SNE Dimension 3'
    )
)

UMAP for Akasa Embeddings

UMAP - 1D

In [124]:
import time
import pandas as pd
import umap.umap_ as umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import plotly.graph_objects as go

# Start time measurement
start_time = time.time()

# Apply UMAP for 1D
umap_embeddings_1D = umap.UMAP(n_components=1, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 1D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_1D[:, 0],
    'Name': name,
    'ID': id,
    'Labels': label
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_1D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_1D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_1D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 1D UMAP embeddings
umap_fig_1D = px.scatter(
    umap_result_df, 
    x='UMAP Dimension 1', 
    y=[0] * len(umap_result_df),  
    hover_name='Name', 
    color='Legend',  
    title='UMAP Projection (1D)',
    hover_data={'Labels': True, 'UMAP_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = umap_result_df[umap_result_df['Name'] == node]
    if not node_data.empty:
        umap_fig_1D.add_annotation(
            x=node_data['UMAP Dimension 1'].values[0],
            y=0,  # y is 0 for 1D
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
umap_fig_1D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='UMAP Dimension 1',
    yaxis_title='y'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_1D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 1D: 5.48 seconds

UMAP - 2D

In [37]:
import time
import pandas as pd
import umap.umap_ as umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import plotly.graph_objects as go

# Start time measurement
start_time = time.time()

# Apply UMAP for 2D
umap_embeddings_2D = umap.UMAP(n_components=2, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 2D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_2D[:, 0],
    'UMAP Dimension 2': umap_embeddings_2D[:, 1],
    'Name': name,
    'ID': id,
    'Labels': label
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1', 'UMAP Dimension 2']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_2D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_2D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_2D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for 2D UMAP embeddings
umap_fig_2D = px.scatter(
    umap_result_df, 
    x='UMAP Dimension 1', 
    y='UMAP Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='UMAP Projection (2D)',
    hover_data={'Labels': True, 'UMAP_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = umap_result_df[umap_result_df['Name'] == node]
    if not node_data.empty:
        umap_fig_2D.add_annotation(
            x=node_data['UMAP Dimension 1'].values[0],
            y=node_data['UMAP Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
umap_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='UMAP Dimension 1',
    yaxis_title='UMAP Dimension 2'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_2D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 2D: 5.38 seconds

UMAP - 3D

In [104]:
import time
import pandas as pd
import plotly.graph_objects as go
import umap
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import warnings

warnings.filterwarnings("ignore")

# Start time measurement
start_time = time.time()

# Apply UMAP for 3D
umap_embeddings_3D = umap.UMAP(n_components=3, n_neighbors=60, min_dist=0.1, random_state=42).fit_transform(embeddings)

# Measure time after UMAP
umap_time = time.time()
print(f"UMAP 3D: {umap_time - start_time:.2f} seconds")

# Create DataFrame for UMAP embeddings
umap_result_data = {
    'UMAP Dimension 1': umap_embeddings_3D[:, 0],
    'UMAP Dimension 2': umap_embeddings_3D[:, 1],
    'UMAP Dimension 3': umap_embeddings_3D[:, 2],
    'Name': name,
    'ID': id,
    'Labels': label
}
umap_result_df = pd.DataFrame(umap_result_data)

# Perform hierarchical clustering
X_umap = umap_result_df[['UMAP Dimension 1', 'UMAP Dimension 2', 'UMAP Dimension 3']].values
distance_matrix_umap = pdist(X_umap, metric='euclidean')

Z_umap = linkage(distance_matrix_umap, method='ward')
cluster_labels_umap_3D = fcluster(Z_umap, t=8, criterion='maxclust')

# Add cluster labels to the UMAP DataFrame
umap_result_df['UMAP_Cluster'] = cluster_labels_umap_3D

# Save UMAP result to CSV
umap_output_csv_path = 'umap_industry_3D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
umap_result_df['Legend'] = 'Other Nodes'
umap_result_df.loc[umap_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create the interactive 3D scatter plot using go.Scatter3d
umap_fig_3D = go.Figure()

# Add trace for other nodes
umap_fig_3D.add_trace(
    go.Scatter3d(
        x=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 1'],
        y=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 2'],
        z=umap_result_df[umap_result_df['Legend'] == 'Other Nodes']['UMAP Dimension 3'],
        mode='markers',
        marker=dict(size=2, color='blue'),  # Adjusted size for other nodes
        name='Other Nodes'
    )
)

# Add trace for highlighted nodes
umap_fig_3D.add_trace(
    go.Scatter3d(
        x=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 1'],
        y=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 2'],
        z=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['UMAP Dimension 3'],
        mode='markers+text',
        marker=dict(size=6, color='red'),  # Larger size for highlighted nodes
        text=umap_result_df[umap_result_df['Legend'] == 'Highlighted Nodes']['Name'],
        textposition='top center',
        name='Highlighted Nodes'
    )
)

# Update layout for better readability
umap_fig_3D.update_layout(
    title='UMAP Projection (3D)',
    scene=dict(
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        zaxis_title='UMAP Dimension 3'
    ),
    showlegend=False,  # Hide the legend
    legend_title_text='Node Type'
)

# Save the DataFrame to a CSV file
umap_output_csv_path = 'umap_industry_3D.csv'
umap_result_df.to_csv(umap_output_csv_path, index=False)
UMAP 3D: 5.58 seconds

Sammon's Mapping for Akasa Embeddings

Sammon's Mapping - 1D

In [130]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px
import time

# Start time measurement
start_time = time.time()

# Apply MDS for 1D
mds = MDS(n_components=1, max_iter=1000)
mds_embeddings_1D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"Sammon's 1D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data_1D = {
    'Sammon Dimension 1': mds_embeddings_1D[:, 0],
    'Name': name,
    'ID': id,
    'Labels': label
}
mds_result_df_1D = pd.DataFrame(mds_result_data_1D)

# Perform hierarchical clustering
X_mds_1D = mds_result_df_1D[['Sammon Dimension 1']].values
distance_matrix_mds_1D = pdist(X_mds_1D, metric='euclidean')

Z_mds_1D = linkage(distance_matrix_mds_1D, method='ward')
cluster_labels_mds_1D = fcluster(Z_mds_1D, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df_1D['MDS_Cluster'] = cluster_labels_mds_1D

# Save MDS result to CSV
mds_output_csv_path_1D = 'mds_industry_1D.csv'
mds_result_df_1D.to_csv(mds_output_csv_path_1D, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
mds_result_df_1D['Legend'] = 'Other Nodes'
mds_result_df_1D.loc[mds_result_df_1D['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for MDS embeddings
mds_fig_1D = px.scatter(
    mds_result_df_1D, 
    x='Sammon Dimension 1', 
    y=[0] * len(mds_result_df_1D),  # Since it's 1D, y will be constant
    hover_name='Name', 
    color='Legend',  
    title='Sammon Projection (1D)',
    hover_data={'Labels': True, 'MDS_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df_1D[mds_result_df_1D['Name'] == node]
    print(node_data)  # Debug print to check if node_data is correctly identified
    if not node_data.empty:
        mds_fig_1D.add_annotation(
            x=node_data['Sammon Dimension 1'].values[0],
            y=0,
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
mds_fig_1D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='Sammon Dimension 1',
    yaxis_title='y'
)

Sammon's Mapping - 2D

In [71]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.express as px

# Start time measurement
start_time = time.time()

# Apply MDS for 2D
mds = MDS(n_components=2, max_iter=1000)
mds_embeddings_2D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"Sammons 2D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data = {
    'Sammons Dimension 1': mds_embeddings_2D[:, 0],
    'Sammons Dimension 2': mds_embeddings_2D[:, 1],
    'Name': name,
    'ID': id,
    'Labels': label
}
mds_result_df = pd.DataFrame(mds_result_data)

# Perform hierarchical clustering
X_mds = mds_result_df[['Sammons Dimension 1', 'Sammons Dimension 2']].values
distance_matrix_mds = pdist(X_mds, metric='euclidean')

Z_mds = linkage(distance_matrix_mds, method='ward')
cluster_labels_mds_2D = fcluster(Z_mds, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df['MDS_Cluster'] = cluster_labels_mds_2D

# Save MDS result to CSV
mds_output_csv_path = 'mds_industry_2D.csv'
mds_result_df.to_csv(mds_output_csv_path, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
mds_result_df['Legend'] = 'Other Nodes'
mds_result_df.loc[mds_result_df['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create interactive plot for MDS embeddings
mds_fig_2D = px.scatter(
    mds_result_df, 
    x='Sammons Dimension 1', 
    y='Sammons Dimension 2',
    hover_name='Name', 
    color='Legend',  
    title='Sammons Projection (2D)',
    hover_data={'Labels': True, 'MDS_Cluster': True},
    color_discrete_map={'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}
)

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df[mds_result_df['Name'] == node]
    if not node_data.empty:
        mds_fig_2D.add_annotation(
            x=node_data['Sammons Dimension 1'].values[0],
            y=node_data['Sammons Dimension 2'].values[0],
            text=node,
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="red",
            font=dict(size=10, color="red"),
            align="center",
            bgcolor="white",
            opacity=0.8
        )

# Update layout for better readability
mds_fig_2D.update_layout(
    legend_title_text='Node Type',
    xaxis_title='Sammons Dimension 1',
    yaxis_title='Sammons Dimension 2'
)

# Save the DataFrame to a CSV file
mds_output_csv_path = 'mds_industry_2D.csv'
mds_result_df.to_csv(mds_output_csv_path, index=False)
Sammons 2D: 375.20 seconds

Sammon's Mapping - 3D

In [109]:
import pandas as pd
from sklearn.manifold import MDS
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import plotly.graph_objects as go
import time

# Start time measurement
start_time = time.time()

# Apply MDS for 3D
mds = MDS(n_components=3, max_iter=1000)
mds_embeddings_3D = mds.fit_transform(embeddings)

# Measure time after MDS
mds_time = time.time()
print(f"MDS 3D: {mds_time - start_time:.2f} seconds")

# Create DataFrame for MDS embeddings
mds_result_data_3D = {
    'MDS Dimension 1': mds_embeddings_3D[:, 0],
    'MDS Dimension 2': mds_embeddings_3D[:, 1],
    'MDS Dimension 3': mds_embeddings_3D[:, 2],
    'Name': name,
    'ID': id,
    'Labels': label
}
mds_result_df_3D = pd.DataFrame(mds_result_data_3D)

# Perform hierarchical clustering
X_mds_3D = mds_result_df_3D[['MDS Dimension 1', 'MDS Dimension 2', 'MDS Dimension 3']].values
distance_matrix_mds_3D = pdist(X_mds_3D, metric='euclidean')

Z_mds_3D = linkage(distance_matrix_mds_3D, method='ward')
cluster_labels_mds_3D = fcluster(Z_mds_3D, t=8, criterion='maxclust')

# Add cluster labels to the MDS DataFrame
mds_result_df_3D['MDS_Cluster'] = cluster_labels_mds_3D

# Save MDS result to CSV
mds_output_csv_path_3D = 'mds_industry_3D.csv'
mds_result_df_3D.to_csv(mds_output_csv_path_3D, index=False)

# Define the names of the nodes you want to highlight
nodes_to_highlight = ['Value Proposition', 'Seismic Imaging', 'Regional Evalution', 'Outcome Based Contract']

# Create a new column for the legend
mds_result_df_3D['Legend'] = 'Other Nodes'
mds_result_df_3D.loc[mds_result_df_3D['Name'].isin(nodes_to_highlight), 'Legend'] = 'Highlighted Nodes'

# Create 3D scatter plot for MDS embeddings
mds_fig_3D = go.Figure()

# Add scatter plot for all nodes
mds_fig_3D.add_trace(go.Scatter3d(
    x=mds_result_df_3D['MDS Dimension 1'],
    y=mds_result_df_3D['MDS Dimension 2'],
    z=mds_result_df_3D['MDS Dimension 3'],
    mode='markers',
    marker=dict(
        size=2,
        color=mds_result_df_3D['Legend'].map({'Highlighted Nodes': 'red', 'Other Nodes': 'blue'}),
        opacity=0.8
    ),
    text=mds_result_df_3D['Name'],
    hoverinfo='text'
))

# Add text annotations for the highlighted nodes
for node in nodes_to_highlight:
    node_data = mds_result_df_3D[mds_result_df_3D['Name'] == node]
    if not node_data.empty:
        mds_fig_3D.add_trace(go.Scatter3d(
            x=[node_data['MDS Dimension 1'].values[0]],
            y=[node_data['MDS Dimension 2'].values[0]],
            z=[node_data['MDS Dimension 3'].values[0]],
            mode='markers+text',
            marker=dict(size=6, color='red'),  # Larger size for highlighted nodes
            text=[node],
            textposition='top center'
        ))

# Update layout for better readability
mds_fig_3D.update_layout(
    title='Sammons Projection (3D)',
    scene=dict(
        xaxis_title='Sammons Dimension 1',
        yaxis_title='Sammons Dimension 2',
        zaxis_title='Sammons Dimension 3'
    ),
        showlegend=False,  # Hide the legend

)

Evaluation Metrics

Trustworthiness

In [48]:
# Define the embeddings and methods
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']
embeddings = {
    'original': embeddings,  
    'tsne': [tsne_embeddings_1D, tsne_embeddings_2D, tsne_embeddings_3D],
    'umap': [umap_embeddings_1D, umap_embeddings_2D, umap_embeddings_3D],
    'sammon': [mds_embeddings_1D, mds_embeddings_2D, mds_embeddings_3D]
}
In [49]:
from sklearn.manifold import trustworthiness

# Calculate and print trustworthiness scores
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        trust_score = trustworthiness(embeddings['original'], embeddings[method][idx], n_neighbors=5)
        print(f"{method.capitalize()} Trustworthiness ({dim}): {trust_score: .4f}")
Tsne Trustworthiness (1D):  0.9249
Tsne Trustworthiness (2D):  0.9563
Tsne Trustworthiness (3D):  0.7977
Umap Trustworthiness (1D):  0.8209
Umap Trustworthiness (2D):  0.9205
Umap Trustworthiness (3D):  0.9487
Sammon Trustworthiness (1D):  0.6037
Sammon Trustworthiness (2D):  0.7877
Sammon Trustworthiness (3D):  0.8509

Continuity

In [50]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

def continuity(original_data, reduced_data, k=5):
    """
    Calculate the continuity metric to evaluate how well the local neighborhoods
    are preserved after dimensionality reduction.
    """
    # Find K nearest neighbors in the original high-dimensional space
    nbrs_original = NearestNeighbors(n_neighbors=k+1, algorithm='auto').fit(original_data)
    _, indices_original = nbrs_original.kneighbors(original_data)
    
    # Find K nearest neighbors in the reduced-dimensional space
    nbrs_reduced = NearestNeighbors(n_neighbors=k+1, algorithm='auto').fit(reduced_data)
    _, indices_reduced = nbrs_reduced.kneighbors(reduced_data)
    
    # Exclude the point itself from the neighbor list
    indices_original = indices_original[:, 1:]
    indices_reduced = indices_reduced[:, 1:]
    
    # Compute continuity rate
    continuity_rates = [
        len(set(indices_original[i]).intersection(indices_reduced[i])) / k
        for i in range(len(original_data))
    ]
    
    return np.mean(continuity_rates)

# Calculate and print continuity metrics
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        continuity_score = continuity(embeddings['original'], embeddings[method][idx], k=5)
        print(f"{method.capitalize()} Continuity ({dim}): {continuity_score: .4f}")
Tsne Continuity (1D):  0.3510
Tsne Continuity (2D):  0.5575
Tsne Continuity (3D):  0.2803
Umap Continuity (1D):  0.0937
Umap Continuity (2D):  0.2932
Umap Continuity (3D):  0.3857
Sammon Continuity (1D):  0.0244
Sammon Continuity (2D):  0.1279
Sammon Continuity (3D):  0.1895

Silhouette Score

In [52]:
from sklearn.metrics import silhouette_score

# Define your methods and dimensions
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']

cluster_labels = {
    'tsne': [cluster_labels_tsne_1D, cluster_labels_tsne_2D, cluster_labels_tsne_3D],
    'umap': [cluster_labels_umap_1D, cluster_labels_umap_2D, cluster_labels_umap_3D],
    'sammon': [cluster_labels_mds_1D, cluster_labels_mds_2D, cluster_labels_mds_3D]
}

# Calculate and print normalized silhouette scores
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        # Calculate silhouette score
        score = silhouette_score(embeddings[method][idx], cluster_labels[method][idx])
        # Normalize the silhouette score to the range [0, 1]
        normalized_score = (score + 1) / 2
        print(f"{method.capitalize()} Silhouette ({dim}): {normalized_score:.4f}")
Tsne Silhouette (1D): 0.7490
Tsne Silhouette (2D): 0.6659
Tsne Silhouette (3D): 0.5754
Umap Silhouette (1D): 0.7653
Umap Silhouette (2D): 0.6846
Umap Silhouette (3D): 0.6648
Sammon Silhouette (1D): 0.7568
Sammon Silhouette (2D): 0.6399
Sammon Silhouette (3D): 0.6105

RMSE of Distances

In [53]:
import numpy as np
from sklearn.metrics import mean_squared_error
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import MinMaxScaler

# Compute pairwise distances in original space
original_distances = squareform(pdist(embeddings['original'], metric='euclidean'))

# Initialize lists to store RMSE values
rmse_values = []

# Calculate RMSE for each method and dimension
for method in methods:
    for dim in dims:
        idx = dims.index(dim)
        reduced_distances = squareform(pdist(embeddings[method][idx], metric='euclidean'))
        rmse = np.sqrt(mean_squared_error(original_distances, reduced_distances))
        rmse_values.append(rmse)

# Normalize RMSE values to the range (0, 1)
scaler = MinMaxScaler()
rmse_values_reshaped = np.array(rmse_values).reshape(-1, 1)  # Reshape for scaler
rmse_normalized = scaler.fit_transform(rmse_values_reshaped).flatten()

# Print original and normalized RMSE values
print("\nOriginal and Normalized RMSE values:")
for i, method in enumerate(methods):
    for j, dim in enumerate(dims):
        idx = i * len(dims) + j
        print(f"{method.capitalize()} RMSE ({dim}): {rmse_values[idx]:.4f}, Normalized: {rmse_normalized[idx]:.4f}")
Original and Normalized RMSE values:
Tsne RMSE (1D): 28.8425, Normalized: 0.1555
Tsne RMSE (2D): 64.8666, Normalized: 0.3521
Tsne RMSE (3D): 183.5711, Normalized: 1.0000
Umap RMSE (1D): 5.9294, Normalized: 0.0304
Umap RMSE (2D): 2.5188, Normalized: 0.0118
Umap RMSE (3D): 2.0299, Normalized: 0.0091
Sammon RMSE (1D): 0.6640, Normalized: 0.0017
Sammon RMSE (2D): 0.4618, Normalized: 0.0006
Sammon RMSE (3D): 0.3577, Normalized: 0.0000

K-Nearest Neighbor (KNN) Retention

In [54]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
 
# Function to calculate KNN retention
def knn_retention(high_dim_embeddings, low_dim_embeddings, k=5):
    # Find k-nearest neighbors in high-dimensional space
    high_dim_nn = NearestNeighbors(n_neighbors=k).fit(high_dim_embeddings)
    high_dim_neighbors = high_dim_nn.kneighbors(high_dim_embeddings, return_distance=False)
    # Find k-nearest neighbors in low-dimensional space
    low_dim_nn = NearestNeighbors(n_neighbors=k).fit(low_dim_embeddings)
    low_dim_neighbors = low_dim_nn.kneighbors(low_dim_embeddings, return_distance=False)
    # Calculate overlap
    total_overlap = 0
    for i in range(high_dim_embeddings.shape[0]):
        overlap = np.intersect1d(high_dim_neighbors[i], low_dim_neighbors[i]).size
        total_overlap += overlap
    # Calculate average retention
    avg_retention = total_overlap / (high_dim_embeddings.shape[0] * k)
    return avg_retention

# Compute and print KNN retention for each method and dimension
for method in methods:
    for dim in dims:
        if dim == '1D':
            reduced_embedding = embeddings[method][0]
        elif dim == '2D':
            reduced_embedding = embeddings[method][1]
        elif dim == '3D':
            reduced_embedding = embeddings[method][2]
        knn_ret_value = knn_retention(embeddings['original'], reduced_embedding)
        print(f"KNN retention between original and {method} ({dim}): {knn_ret_value:.4f}")
KNN retention between original and tsne (1D): 0.4860
KNN retention between original and tsne (2D): 0.6695
KNN retention between original and tsne (3D): 0.4444
KNN retention between original and umap (1D): 0.2662
KNN retention between original and umap (2D): 0.4241
KNN retention between original and umap (3D): 0.5085
KNN retention between original and sammon (1D): 0.2202
KNN retention between original and sammon (2D): 0.3113
KNN retention between original and sammon (3D): 0.3672

Sammon's Stress

In [63]:
from scipy.spatial.distance import pdist
import numpy as np

# Define the embeddings and methods
methods = ['tsne', 'umap', 'sammon']
dims = ['1D', '2D', '3D']
embeddings = {
    'original': embeddings,  # Original high-dimensional embeddings
    'tsne': [tsne_embeddings_1D, tsne_embeddings_2D, tsne_embeddings_3D],
    'umap': [umap_embeddings_1D, umap_embeddings_2D, umap_embeddings_3D],
    'sammon': [mds_embeddings_1D, mds_embeddings_2D, mds_embeddings_3D]
}

# Calculate pairwise distances in the original space
original_distances = pdist(embeddings['original'], metric='euclidean')

# Define Sammon's Stress function
def sammon_stress(original_distances, projected_distances):
    # Filter out zero distances to avoid division by zero
    mask = original_distances > 0
    original_distances = original_distances[mask]
    projected_distances = projected_distances[mask]
    
    # Compute the numerator and denominator
    numerator = np.sum(((original_distances - projected_distances) / original_distances) ** 2)
    denominator = np.sum(original_distances ** 2)  # Usually just the sum of squared original distances
    stress = numerator / denominator
    return stress

# Calculate Sammon's Stress for each method and dimension
for method in methods:
    for dim in dims:
        # Get the projected embeddings
        projected_embeddings = embeddings[method][dims.index(dim)]
        
        # Calculate pairwise distances for projected embeddings
        projected_distances = pdist(projected_embeddings, metric='euclidean')
        
        # Calculate stress
        stress = sammon_stress(original_distances, projected_distances)
        print(f"Sammon's Stress ({method.upper()} {dim}): {stress:.4f}")
Sammon's Stress (TSNE 1D): 419.7055
Sammon's Stress (TSNE 2D): 2117.6760
Sammon's Stress (TSNE 3D): 17768.0276
Sammon's Stress (UMAP 1D): 17.6705
Sammon's Stress (UMAP 2D): 3.1937
Sammon's Stress (UMAP 3D): 2.0653
Sammon's Stress (SAMMON 1D): 0.2411
Sammon's Stress (SAMMON 2D): 0.1185
Sammon's Stress (SAMMON 3D): 0.0712

Sammon's Mapping (1D, 2D, 3D) has the lowest stress values across all dimensions. This indicates that it is performing the best in terms of preserving the distances from the high-dimensional space.

Visualization of 1D Projection

In [129]:
tsne_fig_1D.show(renderer='notebook')

umap_fig_1D.show(renderer='notebook')
mds_fig_1D.show(renderer='notebook')

Visualization of 2D Projection

In [74]:
tsne_fig_2D.show(renderer='notebook')
umap_fig_2D.show(renderer='notebook')
mds_fig_2D.show(renderer='notebook')

Visualization of 3D Projection

In [122]:
tsne_fig_3D.show(renderer='notebook')
umap_fig_3D.show(renderer='notebook')
mds_fig_3D.show(renderer='notebook')

Conclusion

Metric t-SNE (1D) UMAP (1D) Sammon's Mapping (1D) t-SNE (2D) UMAP (2D) Sammon's Mapping (2D) t-SNE (3D) UMAP (3D) Sammon's Mapping (3D)
Trustworthiness 0.9249 0.8209 0.6037 0.9563 0.9205 0.7877 0.7977 0.9487 0.8509
Continuity 0.3510 0.0937 0.0244 0.5575 0.2932 0.1279 0.2803 0.3857 0.1895
Silhouette Score 0.7490 0.7653 0.7568 0.6659 0.6846 0.6399 0.5754 0.6648 0.6105
RMSE of Distances 0.1555 0.0304 0.0017 0.3521 0.0118 0.0006 1.0000 0.0091 0.0000
K-Nearest Neighbor (KNN) Retention 0.4860 0.2662 0.2202 0.6695 0.4241 0.3113 0.4444 0.5085 0.3672
Aggregate Score 0.4711 0.3831 0.3206 0.4994 0.4621 0.3732 0.2195 0.4997 0.4036

Table 1: Evaluation Scores

UMAP (3D) performs well, particularly in trustworthiness, continuity, and distance preservation. It offers a robust representation with the lowest RMSE, indicating effective preservation of the original distance relationships.

t-SNE (2D) also performs well, showing strong results in trustworthiness and continuity, with high KNN retention. Although it has a higher RMSE than UMAP (3D), it is still effective in capturing the data structure.